POLI 572B

Michael Weaver

February 11, 2019

Plan for Today

Review

  1. Questions on PS 3/ Last Lecture
  2. Least Squares and Dummy Variables

Statistical Models and Least Squares

  1. Statistical Model for Least Squares
  2. Assumptions
  3. Unbiasedness
  4. Variance

Questions

Questions on Problem Set 3

Questions on Least Squares Lecture 2

Least Squares and Dummy Variables

Multivariate Least Squares:

Dummy Variables

What are they?

Binary variables taking \(1\) if the observation has some attribute, \(0\) otherwise

How do we make them?

  • Easy to do in R with as.factor()

lm(y ~ as.factor(category_var))

Dummy Variables

What do they do?

  • They estimate a new intercept (mean) for some group in the data. (Why is this the case?)

Dummy Variables | 2 categories

summary(lm(INCEARN ~ sex, acs_data))$coefficients
##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 142586.6   2360.005 60.41793 0.000000e+00
## sexMale      61122.7   2879.125 21.22961 7.209473e-98

Dummy Variables | 2 categories

  • We can’t have all categories and the intercept.
  • Why?
  • For all observations: Female + Male = 1. We estimate intercept by including a column of 1s. These are linearly dependent, so we can’t include all of them.

Dummy Variables | 2 categories

We can include all categories and drop the intercept

summary(lm(INCEARN ~ -1 + sex, acs_data))$coefficients
##           Estimate Std. Error   t value Pr(>|t|)
## sexFemale 142586.6   2360.005  60.41793        0
## sexMale   203709.3   1649.163 123.52281        0

Dummy Variables | 2 categories

How does our interpretation change?

##           Estimate Std. Error   t value Pr(>|t|)
## sexFemale 142586.6   2360.005  60.41793        0
## sexMale   203709.3   1649.163 123.52281        0

Dummy Variables | 5 categories

We can consider a case where there are several categories:

Brexit Vote: What are regional differences in vote for Brexit?

Five regions: Southern England, Midlands, Northern England, Wales, and Scotland

Dummy Variables | 5 categories

How do you interpret this?

summary(lm(Pct_Lev ~ Pct_Trn +  Region5, brexit))$coefficients
##                             Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)              44.61020359  7.7829764  5.7317665 2.048832e-08
## Pct_Trn                   0.08933588  0.1027488  0.8694588 3.851539e-01
## Region5The Midlands       8.57606186  1.2532253  6.8431923 3.179826e-11
## Region5Northern England   6.35984691  1.3248565  4.8004045 2.293393e-06
## Region5Wales              2.30867033  2.0441434  1.1294072 2.594499e-01
## Region5Scotland         -11.60364022  1.8478179 -6.2796448 9.422928e-10

Dummy Variables | 5 categories

How do you interpret this?

summary(lm(Pct_Lev ~ Pct_Trn + Region5, brexit))$coefficients
##                             Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)              53.18626546  7.7855372   6.8314188 3.420699e-11
## Pct_Trn                   0.08933588  0.1027488   0.8694588 3.851539e-01
## Region5Northern England  -2.21621496  1.5559745  -1.4243260 1.551859e-01
## Region5Wales             -6.26739154  2.2030506  -2.8448696 4.687357e-03
## Region5Scotland         -20.17970209  2.0149071 -10.0152022 4.494699e-21
## Region5Southern England  -8.57606186  1.2532253  -6.8431923 3.179826e-11

Dummy Variables | 5 categories

How do you interpret this?

summary(lm(Pct_Lev ~ Pct_Trn + Region5, brexit))$coefficients
##                             Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)              50.97005050  7.3635671  6.9219239 1.946734e-11
## Pct_Trn                   0.08933588  0.1027488  0.8694588 3.851539e-01
## Region5Wales             -4.05117658  2.1752891 -1.8623624 6.333598e-02
## Region5Scotland         -17.96348713  1.9097366 -9.4062643 5.303722e-19
## Region5Southern England  -6.35984691  1.3248565 -4.8004045 2.293393e-06
## Region5The Midlands       2.21621496  1.5559745  1.4243260 1.551859e-01

Dummy Variables | 5 categories

How do you interpret this?

summary(lm(Pct_Lev ~ Pct_Trn + Region5, brexit))$coefficients
##                             Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)              46.91887392  7.6346802  6.1454930 2.043370e-09
## Pct_Trn                   0.08933588  0.1027488  0.8694588 3.851539e-01
## Region5Scotland         -13.91231055  2.4939139 -5.5785047 4.660849e-08
## Region5Southern England  -2.30867033  2.0441434 -1.1294072 2.594499e-01
## Region5The Midlands       6.26739154  2.2030506  2.8448696 4.687357e-03
## Region5Northern England   4.05117658  2.1752891  1.8623624 6.333598e-02

Dummy Variables | 5 categories

How do you interpret this?

summary(lm(Pct_Lev ~ Pct_Trn +  Region5, brexit))$coefficients
##                            Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)             33.00656337  7.2248567  4.5684731 6.681290e-06
## Pct_Trn                  0.08933588  0.1027488  0.8694588 3.851539e-01
## Region5Southern England 11.60364022  1.8478179  6.2796448 9.422928e-10
## Region5The Midlands     20.17970209  2.0149071 10.0152022 4.494699e-21
## Region5Northern England 17.96348713  1.9097366  9.4062643 5.303722e-19
## Region5Wales            13.91231055  2.4939139  5.5785047 4.660849e-08

Dummy Variables | 5 categories

How do you interpret this?

summary(lm(Pct_Lev ~ -1 + Pct_Trn +  Region5, brexit))$coefficients
##                            Estimate Std. Error   t value     Pr(>|t|)
## Pct_Trn                  0.08933588  0.1027488 0.8694588 3.851539e-01
## Region5Scotland         33.00656337  7.2248567 4.5684731 6.681290e-06
## Region5Southern England 44.61020359  7.7829764 5.7317665 2.048832e-08
## Region5The Midlands     53.18626546  7.7855372 6.8314188 3.420699e-11
## Region5Northern England 50.97005050  7.3635671 6.9219239 1.946734e-11
## Region5Wales            46.91887392  7.6346802 6.1454930 2.043370e-09

Dummy Variables | 5 categories

In R: you can shift the reference category using the following:

levels = c('Scotland', 'Southern England', 'The Midlands', 'Northern England', 'Wales')
brexit$Region5 = factor(brexit$Region5, levels = levels)

The first category is dropped (in this case, Scotland)

Regression and Statistical Inference

Key Points:

  1. To make statistical inferences, we need to make a statistical model that corresponds to the regression model
  2. As with experiments (e.g. the Neyman causal model), statistical models
  • treat observed data as realizations of random variables (some chance process)
  • connect what we observe to unobserved parameters
  1. Need to identify what is being done by
  • data
  • regression mathematics
  • assumptions of statistical model

An example: Hooke’s Law

Hooke’s law is a law of physics that states that the force (\(F\)) needed to extend or compress a spring by some distance \(x\) scales linearly with respect to that distance.

So the length of a spring is linearly propotional the force applied. For a given weight \(X_i\) on a spring, the length \(Y_i\) is:

\[Y_i = \alpha + \beta X_i + \epsilon_i\] for \(i\) in \(1 \ldots n\)

An example: Hooke’s Law

\[Y_i = \alpha + \beta X_i + \epsilon_i\] for \(i\) in \(1 \ldots n\)

This is a statistical model:

  • \(\alpha\) and \(\beta\) are unknown parameters (thus, greek letters, not \(a\) and \(b\))
  • \(\epsilon_i\) is independent and identically distributed with mean \(0\) and variance \(\sigma^2\) (another parameter)

We use observed data to estimate the parameters \(\alpha, \beta, \sigma^2\)

The Regression Model:

The statistical model for the regression makes certain assumptions about the process generating the data.

To find a line, we create a statistical model for a linear equation that describes the process generating the data.

\[Y_i = \alpha + \beta X_i + \epsilon_i\]

deterministic (not random) \(\alpha + \beta X_i\)

stochastic (random error) \(\epsilon_i\)

The Regression Model:

\(\alpha\): parameter for the intercept (fixed for all cases)

\(\beta\): parameter for the slope (fixed for all cases)

\(X_i\): value of independent variable (input into the equation)

\(Y_i\): value of dependent variable (determined by the equation)

\(\epsilon_i\): random error for observation \(i\)

  • \(\epsilon\) is independent and identically distributed
  • This mean each error is a random draw from a distribution with mean \(0\) and variance \(\sigma^2\).
  • i.i.d. implies that random draws are independent of each other and come from the same distribution every time.
x = rnorm(100)
y = 3 + x * 0.5
plot(x, y, main = bquote(Y[i] == alpha + beta * X[i])) 
abline(a = 3, b = 0.5)

y_star = 3 + x * 0.5 + rnorm(100)
plot(x, y_star, main = bquote(Y[i] == alpha + beta * X[i] + epsilon[i]))
abline(a = 3, b = 0.5)

The Regression Model:

This regression model:

\[Y_i = \alpha + \beta X_i + \epsilon_i\]

is linked to the least-squares calculations we did the last two weeks. The regression model can be estimated with least squares:

\[Y_i = \hat{\alpha} + \hat{\beta} X_i + e_i\]

where \(Var(e) = \widehat{\sigma^2}\)

Differences:

  • contrast parameters and estimates (\(\widehat{}\))
  • contrast \(\epsilon_i\) and \(e_i\).

The Regression Model:

Another key assumption: (if \(X\) is also a random variable) \(X \mathrel{\unicode{x2AEB}} \epsilon\)

  • This means \(X\) is independent of \(\epsilon\) (recall dependence versus independence)
  • independence is not orthogonality
  • If this were not true: could be (in Hooke’s law example)

If we put a heavier weight (large \(X_i\)) and that increased likelihood of large \(\epsilon_i\) and lighter weight (small \(X_i\)) increased likelihood of smaller \(\epsilon_i\), then we would over-estimate \(\beta\).

Model Assumptions:

  1. Model equation is correctly specified (correct functional forms, all relevant variables are included). \(Y_i = \alpha + \beta X_i + \epsilon_i\) is the true data-generating function

  2. \(\epsilon_i\) are independently, identically distributed (i.i.d.), \(E(\epsilon_i) = 0\) and variance of \(\sigma^2\). homoskedastic vs. heteroskedastic

  3. \(\epsilon_i\) is independent of \(X_i\): \(X \mathrel{\unicode{x2AEB}} \epsilon\)

We cannot empirically verify these assumptions.

When using least squares to estimate this model, we mathematically impose \(e_i\) to be orthogonal to \(x_i\) and \(E(e_i) = 0\). But this is by assumption, not a fact.

Model Assumptions:

If these assumptions are a good model for the data we observe, then with large \(n\), we can estimate our parameters \(\alpha, \beta, \sigma^2\) using least squares.

The Regression Model:

In the multivariate situation, the statistical model for regression is:

\[Y = \mathbf{X\beta} + \epsilon\]

  • Like when covered least squares \(Y_{n \times 1}\), \(\mathbf{X}_{n \times p}\) (called the design matrix), \(\mathbf{\beta}_{p \times 1}\).
  • Now we have \(\epsilon_{n \times 1}\)

Each row of this becomes:

\[Y_i = \mathbf{X_i\beta} + \epsilon_i\]

The Regression Model:

Statistical Assumptions (statistical model):

  • \(Y\) is generated by \(\mathbf{X\beta} + \epsilon\)
  • \(\epsilon_i\) are independently, identically distributed, with variance \(\sigma^2\)
  • \(X \mathrel{\unicode{x2AEB}} \epsilon\)

Estimating the Regression Model

The Regression Model:

Mathematical Assumptions (least squares):

  • design matrix has full rank (no linear dependence b/t columns)
  • \(n \geq p\)

Statistical Assumptions (statistical model):

  • \(Y\) is generated by \(\mathbf{X\beta} + \epsilon\)
  • \(\epsilon_i\) are independently, identically distributed, with variance \(\sigma^2\)
  • \(X \mathrel{\unicode{x2AEB}} \epsilon\)

Estimation:

We want to estimate \(p + 1\) parameters… (what are they?)

\(\beta\) estimated by \(\widehat{\beta}\) using least squares:

\[\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\]

  • Note: \(\widehat{\beta}\) is a random vector; each element is a random variable. Why?, recall \(Y = \mathbf{X\beta} + \epsilon\)

  • Under the statistical assumption we made, \(\widehat{\beta}\) is an unbiased estimator of \(\beta\), conditional on \(X\): \(E(\widehat{\beta}|X) = \beta\)
    • “conditional on \(X\)” means taking the values of \(X\) as fixed

LS estimate is unbiased:

By assumption: \(Y = \mathbf{X\beta} + \epsilon\), so:

\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\)

\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'(\mathbf{X\beta} + \epsilon)\)

\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)

\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)

LS estimate is unbiased:

\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)

So:

\(E(\widehat{\beta}|X) = E(\mathbf{\beta}|\mathbf{X}) + E((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon|\mathbf{X})\)

\(E(\widehat{\beta}|X) = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon|\mathbf{X})\)

And because, by assumption, \(X \mathrel{\unicode{x2AEB}} \epsilon\)

\(E(\widehat{\beta}|X) = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon)\)

And because, by assumption, \(E(\epsilon) = 0\)

\(E(\widehat{\beta}|X) = \mathbf{\beta}\)

Variance of OLS Estimator:

Thus, if our model assumptions are correct, \(\widehat{\beta}\) is an unbiased estimate of parameter \(\beta\)

  • \(\widehat{\beta}\) is a random vector, as it depends on chance draws of \(\epsilon\)
  • As \(n \to \infty\), \(\widehat{\beta} \approx \beta\), but there is still uncertainty.

We need to find a way of finding the uncertainty in our estimate \(\widehat{\beta}\)

Variance of our Estimate

Variance of OLS Estimator:

  • Each element of \(\widehat{\beta}_{p \times 1}\) is a random variable; it has a variance and a covariance with each other element.
  • This is given by the variance-covariance matrix of \(\widehat{\beta}\)

\[\scriptsize{\begin{pmatrix} Var(\beta_1) & Cov(\beta_2,\beta_1) & Cov(\beta_3,\beta_1) &\ldots & Cov(\beta_p,\beta_1) \\ Cov(\beta_1,\beta_2) & Var(\beta_2) & Cov(\beta_3,\beta_2) & \ldots & Cov(\beta_p,\beta_2) \\ Cov(\beta_1,\beta_3) & Cov(\beta_2,\beta_3) & Var(\beta_3) & \ldots & Cov(\beta_p,\beta_3) \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ Cov(\beta_1,\beta_p) & Cov(\beta_2,\beta_p) & Cov(\beta_3, \beta_p) & \ldots & Var(\beta_p)\end{pmatrix}}\]

  • Diagonal elements are the variances, off-diagonal elements are covariances; matrix is symmetric

Variance of OLS Estimator:

How do we use the variance-covariance matrix?

  • The square-root of diagnonal elements (variances) gives standard error for each parameter estimate in \(\widehat{\beta}\) (hypothesis testing)

  • The off-diagonal elements can help answer: \(\beta_2 + \beta_3 \neq 0\). We need \(Cov(\beta_2, \beta_3)\) to get \(Var(\beta_2 + \beta_3)\).

Variance-Covariance Matrix

\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\), So:

\[cov(\widehat{\beta}|X) = E((\widehat{\beta} - \beta)(\widehat{\beta} - \beta)' | X)\]

\[ = E( ((\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\epsilon)((\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\epsilon)' | X)\]

\[ = E( ((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon)(\epsilon'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}) | X)\]

\[ = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon\epsilon'|X)\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]

What really matters here is: \(E(\epsilon\epsilon'|X)\)

Variance-Covariance Matrix

\[E(\epsilon\epsilon'|X) = \sigma^2\mathbf{I}_{p\times p}\]

\(\epsilon\epsilon'\) is \(n \times n\) matrix with \(ij\)th elements \(\epsilon_i\epsilon_j\)

  • this is the variance covariance matrix of the errors (we don’t subtract means because \(E(\epsilon) = 0\))

Taking the expectation:

  • If \(i \neq j\), the elements of the matrix are \(E(\epsilon_i\epsilon_j)\). Because, by assumption, \(E(\epsilon) = 0\), and \(\epsilon\) is independent, \(E(\epsilon_i\epsilon_j) = E(\epsilon_i)E(\epsilon_j) = 0 \times 0 = 0\)
  • but, for \(i = j\), \(E(\epsilon_i^2) = E(\epsilon_i - E(\epsilon))^2\) (a variance)
  • and because \(\epsilon\) is identically distributed, then \(Var(\epsilon_i) = Var(\epsilon) = \sigma^2\) for all \(\epsilon\)

Variance-Covariance Matrix

\[ = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\sigma^2\mathbf{I}_{n\times n}\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]

\[ = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\] \[cov(\widehat{\beta}|X) = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\]

Can we start computing standard errors for \(\widehat{\beta}\) now? What is missing?

Variance-Covariance Matrix

\(\hat{\sigma^2} = \frac{\sum\limits_i^n e_i^2}{n - p}\)

  • Why \(n-p\)? The variance of the sample mean required \(n-1\), because we calculated the mean first, so we lost 1 degree of freedom. With bivariate regression, we calculate \(\hat{\alpha}, \hat{\beta}\), so we lose 2 df. With \(p\) variables, we lose \(p\) degrees of freedom.

and \(E[\hat{\sigma^2} | X] = \sigma^2\) (unbiased)

so, we estimate the variance-covariance matrix:

\(\widehat{cov}(\widehat{\beta}) = \widehat{\sigma^2}(\mathbf{X}'\mathbf{X})^{-1}\)

Standard Errors of \(\hat{\beta}\)

THe variance (first diagonal element of \(\widehat{cov}(\widehat{\beta})\)) is

\(Var(\hat{\beta_p}) = \frac{\hat{\sigma^2}}{\sum\limits_i^n (X^*_{i} - \overline{X^*})^2}\)

So the standard error is:

\(SE(\hat{\beta_p}) = \sqrt{Var(\hat{\beta_p})} = \frac{\widehat{\sigma}}{\sqrt{n}\sqrt{Var(X_p^*)}}\)

  • standard errors get smaller (more precise): with growing \(n\), increased (residual) variation in \(X_1\), smaller error variance (\(\sigma^2\))

Standard errors

99 in control, 1 in treatment

## [1] 0.01

Standard errors

50 in control, 50 in treatment

## [1] 0.2525253

Standard errors

In sum:

  • We get \(\widehat{\sigma^2}\) by taking variance of the residuals (\(e\))
  • We take diagonal of variance-covariance matrix to get variances of slopes (\(\widehat{\sigma^2}(\mathbf{X}'\mathbf{X})^{-1}\))
  • Take square root of variances to standard errors for \(\widehat{\beta}\)

For these to be correct, we must assume:

  • \(Y\) is generated by \(\mathbf{X\beta} + \epsilon\)
  • \(\epsilon_i\) are independently, identically distributed, with variance \(\sigma^2\) for all \(i\)
  • \(X \mathrel{\unicode{x2AEB}} \epsilon\)

If any assumption is wrong, unbiasedness of \(\widehat{\beta}\) and standard errors will be incorrect.

Using the Regression Model

Assumptions: Example

Hooke’s Law

  • \(Y\) is generated by \(\mathbf{X\beta} + \epsilon\)
  • seems reasonable, since the law is linear in the weights placed on the spring

  • \(\epsilon_i\) are independently, identically distributed, with variance \(\sigma^2\) for all \(i\)
  • are errors in measurement of length, weights, or other procedures that affect length of the spring independent and share the same variance?

  • \(X \mathrel{\unicode{x2AEB}} \epsilon\)
  • are the errors unrelated to the weight placed on the spring?

OLS Assumptions:

When do things go wrong?

  • We incorrectly specify the model (it is multiplicative, or we forgot a term)
  • \(Y_i = \alpha + \beta_1 x + \beta_2 x^2 + \epsilon\)
  • \(Y_i = \alpha + \beta_1 x + \beta_2 z + \epsilon\)
  • Errors are heteroskedastic: errors have different variances.
  • E.g.: At higher levels of education, incomes take on wider range of values (think PhD vs. MBA students earning potential), so the errors are larger.

OLS Assumptions:

  • Errors are not independent: errors for different observations are correlated.
  • We observe the same counties over time. The error we observer for county \(i\) at time \(t\) is likely to be correlated with the error of county \(i\) at time \(t+1\), because it is the same place.
  • \(\epsilon\) is not independent of \(x\). The errors are in fact correlated with \(x\):
  • This is “confounding”. Our estimate \(\hat{\beta}\) as the coefficient of \(x\) will be biased. If we excluded \(z\) from the equation incorrectly, the bias in \(x\) will be a function of the relationships between \(x\) and \(z\), and between \(z\) and \(y\).

Hypothesis Tests

\(H_\emptyset: \beta = \beta^*\) \(H_a: \beta \neq \beta^*\)

Typically, we take \(\beta^* = 0\), but in principle it can be any value.

Because we estimate the standard errors, use a \(t\) statistic:

\(t_{n-p} = \frac{\hat{\beta} - \beta^*}{se(\hat{\beta})}\)

Then obtain a \(p\) value (one-sided or two-sided, depending on \(H_a\))

  • if \(|t| > 1.96\) then \(p \approx 0.05\)

Sampling Distribution of \(\hat{\beta}\)

If we are using a \(t\)-test, don’t we need to make assumptions about asymptotic normality? Yes.

Two approaches

  1. If we assume \(\epsilon\) is distributed normally \(N(mean = 0, variance = \sigma^2)\), then sampling distribution of \(\hat{\beta}\) is always normal.

  2. If \(\epsilon\) not normally distributed: under some believable and verifiable assumptions about the data and assumption that observations are independent, then \(\lim_{n\to\infty}\) sampling distribution of \(\hat{\beta}\) is normally distributed by the central limit theorem. This is true if \(n\) is moderately large and data is not highly non-normal. If that happens, we can always bootstrap.

An Example

Regress Brexit Vote on Rainfall

An Example

#Make design-matrix
X = cbind(1, brexit$total_precip_polling)
#Make outcome matrix
Y = brexit$Pct_Lev

#Solve for Beta
beta_mat = solve(t(X) %*% X) %*% t(X) %*% Y

#Create y_hat
y_hat = X %*% beta_mat

#Create residuals
e = Y - y_hat

An Example

#Estimate sigma squared
sigma2_hat = sum(e^2) / (nrow(X) - ncol(X))

#Estimate variance-covariance
var_covar = sigma2_hat * solve(t(X) %*% X)

var_covar
##             [,1]         [,2]
## [1,]  0.39786504 -0.012415453
## [2,] -0.01241545  0.001306648

An Example

#beta standard errors:
beta_ses = var_covar %>% diag %>% sqrt
#get coefficient, standard errors, and t-stats
cbind(beta_mat, beta_ses, beta_mat / beta_ses)
##                   beta_ses          
## [1,] 54.1460994 0.63076544 85.841894
## [2,] -0.1058053 0.03614759 -2.927036
#check against `lm`
lm(Pct_Lev ~ total_precip_polling, brexit) %>% summary %>% .$coefficients
##                        Estimate Std. Error   t value      Pr(>|t|)
## (Intercept)          54.1460994 0.63076544 85.841894 5.311451e-250
## total_precip_polling -0.1058053 0.03614759 -2.927036  3.628998e-03

An Example

Let’s interpret the hypothesis test:

##                        Estimate Std. Error   t value      Pr(>|t|)
## (Intercept)          54.1460994 0.63076544 85.841894 5.311451e-250
## total_precip_polling -0.1058053 0.03614759 -2.927036  3.628998e-03

For the intercept: is it “significant”? Is this informative in any meaningful way?

For the intercept: is it “slope”? Is this informative in any meaningful way?

An Example

Does the regression model apply in this context?

Is the model: \(Brexit \ Leave \ \%_i= \alpha + \beta \ mmRainfall_i + \epsilon_i\) correct?

Is \(\epsilon\) i.i.d. with mean \(0\) and variance \(\sigma^2\)?

Is \(\epsilon_i\) independent of \(mmRainfall_i\)?

Key Takeaways:

Given these assumptions (sometimes called Gauss-Markov assumptions)

  1. \(Y\) is generated by \(\mathbf{X\beta} + \epsilon\)
  2. \(\epsilon_i\) are independently, identically distributed, with variance \(\sigma^2\)
  3. \(X \mathrel{\unicode{x2AEB}} \epsilon\)

we can estimate \(\beta\) using \(\widehat{\beta}\) from least squares, and this estimate will be

  • unbiased
  • variance of estimate can also be estimated

But are these assumptions reasonable (compared to Neyman causal model)?